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1 Introduction 

The data tracking problem is one of the most basic problems in the Internet research area. Assume we have 
a set 1 • • • m of objects and a set 1 • • • n of nodes. As shown in Figure 1, we consider the whole Internet as a 
communication service such that each two nodes i and j can communicate via the Internet, but with some 
cost Cij. In these notes, when we say two nodes i and j in the network are neighbors, we mean i and j are 
near each other by some logical measure such as latency, not necessarily next-hop neighborhood. 




Figure 1: The Internet as a communication service 

We also assume that each node contains a set of copies of objects. Now, our goal is to find a (nearby) 
copy of the requested object, insert or delete object copies and finally update control information as nodes 
join or leave the system. More precisely, we have the following operations. 

1. find(u,x) in which node u issues a request to locate and obtain a copy of object x; 

2. insert(u,x) in which node u inserts a new copy of object x into its storage space; 

3. delete(u,x) in which node u deletes an existing copy of object x from its storage space; 

4. join(u) in which node u joins the system; and 

5. leave(u) in which node u leaves the system. 

We note that the insert and delete operations refer to the nodes' public storage space. Thus, insert(u,x) 
makes a new copy of a; at u available for access to all the network nodes, while delete(u,x) makes an existing 
available copy of a: at m unavailable to the rest of the network. We also note that following a join or leave 
operation, it is quite likely that we have a series of insert or delete operations. 
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In tracking distributed objects, we assume that all nodes are equal and have the same power. For example, 
we do not have some central or special nodes which are more powerful than others. The main reason behind 
this assumption is that in real distributed systems like the Internet, nodes can join or leave very frequently 
and often we cannot assume that some nodes are very powerful. In addition, this assumption causes our 
system to be more fault-tolerant, e.g. if each node shuts-down abruptly, then we have only a leave operation 
and we can adapt the whole system very easily. 

The data tracking problem has many applications. First, consider the DNS in which we need to map 
names to IP addresses. In this system, our objects are IP addresses whose copies are stored in some nodes 
of the Internet. Given a name, we want to find a copy of the object which contains its IP address. A second 
application is in peer-to-peer netwroks where each node acts as both a client and a server, and we need to 
provide efficient access to shared data using lightweight procedures. Again basic operations in these networks 
can be considered special cases of the operations mentioned above. Other examples are replicated servers, 
in which the join and leave operations may be ignored, and tracking mobile users, in which we have only 
one copy of each node (each node contains a unique object) [AP95] . 

Current popular commercial systems using peer-to-peer file sharing systems are Napster, Gnutella and 
Freenet. We discuss Gnutella and Freenet more in the next two subsections. Selected academic research 
projects in this area are Oceanstore at University of California at Berkeley which is conducted by Kubiatowics 
et al.. Chord at Massachusetts Institute of Technology which is conducted by Stoica et al. and Content 
Addressable Network ( CAN) again at University of California at Berekeley which is conducted by Ratnasamy 
et al. 

1.1 Gnutella 

Gnutella network is a decentralized, unmanaged system for sharing, searching, and acquiring files. The 
Gnutella network supports sharing and searching of any file type unlike Napster which only allows users 
to share certain types of files, namely MP3s. However Gnutella does not offer any extra functionality, like 
chatting. Gnutella is a peer-to-peer system, with client software that also acts as a server. Gnutella was 
created by a group of developers at Nullsoft, a subsidiary of America Online. 

Suppose that a node u wants to find an object like a web-page which contains some keywords. Gnutella 
works by flooding protocol in which node X sends its request to all its neighbors and then these neighbors 
send the request to all their neighbors until the request is reached by a node Y containing the object and 
then node Y returns a copy of the object to X (see Figure 2). In this file-system, each request is satisfied by 
a nearby copy and thus it is efficient in terms of the find cost. The flooding process is somehow controlled in 
the system, that is, each node has a searchable index in a database which prevents flooding via some wrong 
paths. However this system is not scalable and in the worst-case, the entire network may be flooded. Also 
for eliminating loops, the system uses a time-to-live (TTL) field, which may prevent excessive flooding. 
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Figure 2: Finding an object in Gnutella 



1.2 Freenet 

Free Network (Freenet) is a large-scale peer-to-peer decentralized network which pools the power of member 
computers around the world to create a massive virtual information store open to anyone to publish or 
view information of all kinds freely. Freenet dynamically replicates and relocates information in response to 
demand to provide efficient service and minimal bandwidth usage regardless of load. In addition, Freenet is 
private and secure, i.e. information stored in Freenet is protected by strong cryptography to guard against 
malicious tampering or counterfeiting. This network is an enhanced Open Source implementation of the 
system described by Ian Clarke's 1999 paper "A distributed decentralized information storage and retrieval 
system." The first version (Version 0.1) of this system was released in March 2000. 

To find an object, Freenet uses a DFS of the network, which results in a sequential version of flooding 
in the worst-case (Figure 3 illustrates a sequence of query and response messages in Freenet) . Here we have 
a trade-off between efficiency and scalability, i.e. a request may have to be forwarded along a long chain of 
nodes before being satisfied, but little congestion is caused due to a single request. 
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Figure 3: Finding an object in Freenet 



1.3 Measures 

Communication cost between two nodes is an idealized function of latency, bandwidth, queue size, etc. For 
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analysis, we assume that our cost function is a metric static cost. The cost of an operation is the total 
communication cost of messages it sends to conduct the operation. 

Let w be a node nearest to u that has a copy of x. We define stretch offind( u,x) operation as cost(uv)^''^'' 
The stretch of insert and delete operations are defined similarly [BFR95]. For join and leave operations, we 
use the number of nodes that are updated as a measure instead of communication cost. 

Memory overhead is defined to be the maximum amount of control information stored at a node of the 
network. Here by control information, we mean the list of nodes to which node forwards requests or the list 
of objects of which the node is aware. We also define the static load of a node to be the number of objects it 
is aware of and dynamic load of a node to be the number of find operations affecting the node per unit time. 



2 Algorithms for data tracking 

In this section, we introduce three algorithms for efficient data tracking. The first one is a simple algorithm 
which presents the main ideas. The second algorithm is due to Awerbuch and Peleg [AP90]. Awerbuch and 
Peleg motivated the idea of locality preserving in which, given that a task concerns only a subset of the sites 
located in a small region in the network, one would like the execution of the task to involve only sites in or 
around that region and the cost of the task to be proportional to its locality level. The concepts of sparse 
neighborhood covers and hierarchical clustering decompositions (see Subsection 2.2) were first introduced in 
this paper. Using these concepts, they prove near-optimal bounds on the stretch factor. We see the main 
disadvantage of the algorithm when we have join and leave operations. The second algorithm, which is 
a simpler fiat tracking scheme, is due to Plaxton et al. [PRR99]. The paper considers partially locality 
preserving and static load balancing and forms the data location component of Oceanstore (see Subsection 
2.3). Due to time constraints, we were unable to cover recent work on data tracking algorithms, including 
the Chord and CAN projects. 

2.1 A simple algorithm: a tree-based distributed tracking 

One naive approach for data tracking is embedding a tree into the network and then each node informs all 
its ancestors about its own objects (see Figure 4). Now for find{u,x), we forward the request through the 
path from u to the root of the tree until we find the first ancestor which contains control information about 
object X and then we retrieve the object from an appropriate node. In fact, the tree and its embedding 
determine the location of control information among the network nodes. This approach has some problems. 
The first one is that the root of the tree must have control information about all objects. In other words, we 
have memory overhead 0(m). The second problem is that the embedding may not respect network locality. 
For example, consider a ring of nodes 0, • • • ,n — 1 in which Cy is the number of hops between node i and j 
in the ring plus one (the cost function is metric). One can observe that in every tree embedding there are 
two nodes of distance one in the ring whose distance in the network is fl{n). In the next Subsection, we 
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consider the sparse neighborhood covers introduced by Awerbuch and Peleg [AP90] to overcome the second 
problem. The main idea here is to try to embed multiple "tree-like structures" in the network. This idea 
further was considered by Bartal et al. in their paper on hierarchically well-separated trees [Bar98]. 




Figure 4: Tree-based distributed tracking 
2.2 Sparse neighborhood covers 

The r -neighborhood of a vertex u is defined as Nr{u) = {v\c{u,v) < r}. Recall that c{u,v) is the distance 
between u and v in the network. A sparse r -neighborhood cover is a collection of sets (also called clusters) 
of vertices 5i , • • • ,5/ with the following properties: 

1. for every vertex v there exists some 1 <i < I such that Nr{v) C 5",;; 

2. diameter of each cluster Si (the diameter of the subgraph induced by 5,), 1 < i < ^, is in O(rlogn); 
and 

3. each node belongs to O(logn) clusters. 

A sparse neighborhood covers data structure is a family of sparse r-neighborhood covers for different 
values of r. Applications often require the construction of sparse r-neighborhood covers for 0{\og{Diam{G)) 
successively doubled values, namely r = 1, 2,4, 8, • • • . Here G is our network graph. 

The algorithm for finding a sparse r-neighborhood cover is as follows. 

• Initially all the nodes are unmarked. 

• Repeat the following until all nodes are removed. 

1. Pick an unmarked node u. 

2. Find smallest j such that 2\Njr{u)\ > \N(^j_^.ly{u)\. 

3. Either j < logn exists or Nr\ogn{u) includes all nodes; in the latter case, set j = logn. 

4. Include set N(^j_^.iy{u) in cover. 
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5. Mark all nodes in N(j_^.i)r{u) and remove all nodes in Njj.{u) from further consideration. 

6. If there exists any unmarked node u, go to step 1, otherwise first unmark all nodes and then go 
to step 1. 

First we observe that, when a node v is removed, its Nr{v) is included in some cluster (see Figure 5). 
Using the fact that in each cluster at least half of the nodes will be removed, one can prove that each 
node will be in at most O(logn) clusters (see the details in [ABCP99]). Linal and Saks [LS93] showed how a 
distributed randomized algorithm can be used to compute sparse covers. In this approach, each node u starts 
with Nr{u) as a cluster. Then in each round, every cluster grows simultaneously, but some clusters stop 
because of the others (the ties are broken randomly) . The running time of this algorithm is poly-logarithmic 
time. The reader is referred to the original paper for more details. 



Now, we assume that we have sparse neighborhood covers data structure which contains all , 1 < « < 
\og{Diam{G)) . Our solution for data tracking is based on the hierarchy of covers in the network. For each 
cluster S in each cover M2i, we elect a leader 1{S) and provide internal routing services by constructing a 
tree routing component for S rooted at l(S). For each i, we associate with every node u a home cluster, 
homei{u) G M2i, which is the cluster containing N2i{u). Thus each node has \og{Diam{G)) home clusters. 
Now, consider find{u,x) operation. Node u first tries using the lowest level cover M21 , i.e. forward its 
request for object x to its first home cluster leader, l{homei{u)). If this trial fails, i.e. l{homei{u)) does 
not know any control information about x, u retries sending its message, this time using cover M22, and so 
on, until it finally succeeds. Suppose a node v nearest to u which contains x has distance d to u. Since v is 
contained in A^jnos"*! (u) and diameter of each cluster in M2i is at most 2' logn, we have: 



Thus the stretch of find is O(logn). 

For insert{u,x) or delete(u,x), node u informs the leader of each cluster containing u in M2i for all 
1 <i < \og{Diam{G)) . The worst-case cost for these two operations is in 0{Diam{G) polylog{n)); however. 




Figure 5: Growing regions to obtain an sparse neighborhood cover 



[log d\ 

cost of find{u,x) = 0{ ^ 2'logn) = O(dlogn). 
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an amortized stretch of 0{polylog{n)) can be achieved [BFR95]. The cost of join and leave operations is in 
f2(n), since we need to update all nodes in the worst case. Finally, since some leader nodes need to store 
control information of all objects, the memory overhead of this algorithm is in fl{m). In the next. Subsection 
we present an algorithm with a better memory overhead. 

2.3 A Flat Tracking Scheme 

If, for each object in the system, we maintain a logical tree that randomly maps to the nodes(hopefully 
respecting locality), then the set of objects will be evenly distributed among the nodes and we will have 
good load balance. However, there continues to be a scalability problem in that each node must Imow 
its neighbors in the tree for every possible object tree. To address this, we present a simple fiat track- 
ing scheme that uses randomized embedding of logical trees to achieve static load balance while requiring 
low memory overhead. For a restricted class of cost functions, it also achieves asymptotically efficient cost. 
This scheme forms the data location component of Oceanstore, a data tracking system developed at Berkeley. 

In this scheme, unique IDs are assigned to nodes and objects. The node whose bits match the largest 
prefix of the object ID becomes "root" in that object's tree and is responsible for information necessary to 
locate at least one copy of the object. The thrust of this is to route by successively matching a longer prefix 
of the object ID from node to node until we arrive at one that knows about a copy of the object searched 
for. Note that random node IDs mean that the tracking scheme is topology-sensitive. 




Figure 6: Object access tree 

As shown in figure 6, the parent of a node u is the closest node(by some network metric, e.g. latency) 
whose ID matches the object's ID in a longer prefix than u's ID. For example, node Oil matches 000 in a 
prefix of length 1, better than node 110 which matches no prefix at all. As Oil is the closest such node to 
110, it becomes the parent of 110 in the access tree for 000. Similarly, node O00(root for object 000) happens 
to be closest to 010, and so becomes the direct parent of 010, so that 010 would be able to get to object 000 
in one hop. 
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Figure 7: Overlapping neighbors 

Clearly, as shown in figure 7, there will be overlap among the neighbors of a given node across different 
access trees. In fact, with a little thought, one can see that maintaining the closest node whose ID matches 
the object's ID in a longer prefix for all objects is equivalent to simply maintaining the closest node whose 
ID matches the current node's ID in the first i bits and differs in the {i + l)st bit (assuming n a power of 
2). Thus, the total number of distinct neighbors a node has to store is only logn. A typical neighbor table 
is shown in figure 8. 



level (log n - 1) 



level i 



level 

Neighbor Table for node x 

Figure 8: Node neighbor table 

In addition to neighbors, a node also maintains object pointers to objects that it knows about. An object 
request is routed from node to node, successively matching longer node ID prefixes, until it reaches one that 
has a pointer to the object. If the object is in the system at all, then the root node for that object will point 
to it, so that the lookup is guaranteed to terminate. Any pointers not at root are like cached queries: they 
sometimes alleviate the need to go all the way to the root. 

When a node u inserts an object copy into the system, it propagates the location of the copy through the 
object's search path, leaving an object pointer at each node along the path that is unaware of the object. If 
it encounters a node w that points to an existing copy that is closer to w than u, it changes nothing at w 
and the propagation ends. Otherwise, if u is closer, it updates the pointer at w and continues the propagation. 




node y neai'est to x such that 

y[0..i-l] = x[0..i-l] and y[i] differs from x[i 
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This flat tracking scheme has several advantages. The randomized ID assignment distributes the set of 
object pointers evenly over the nodes. The lookup scheme maintains the neighbor tables at logn size. Thus, 
the system scales well. Under certain assumptions about communication cost functions, the system can also 
be shown to be efficient in that the expected access cost is within a constant factor of optimal. In particular, 
we require that for every node x and real r > 1, the ratio of the number of nodes within distance (cost) 2r 
of X to the number of nodes within distance r is bounded from above and below by constants. [PRR99] 

There are also several limitations. As mentioned above, the efficiency proofs depend on restricted cost 
functions. It does not take into consideration dynamic load on nodes, and the overhead of forwarding the 
requests through several nodes may be signiflcant. The system has no distributed scheme for dynamic node 
joins and leaves. Furthermore, the number of nodes affected by a join or leave may be large, from a practical 
standpoint. 

3 The Load Assignment Problem 

(Notes essentially from Adrian Vetta's presentation) 
Input: 

• A set 5 = {si, S2, . . . , Sn} of sources. 

• Associated with a sink s,; is a load of size 1, . 

• A set T = {^1, ^2, ■ • • , ^r} of sinks. 

• Associated with a sink tj is a cost function Cj{). 

• There is an edge ij if a load from s,; may be routed to tj. 
Output: A minimum cost assignment of the loads to sinks. 

A solution is an assignment x = {xi,X2, . . . ,Xm} where Xe is the load on edge e. 

The Cost Functions 

Given a source s,;, let F,; be the set of sinks incident to s,;. Given a sink tj, let Tj be the set of sources 
incident to tj. Let Xj = J2ierj ^ij the total load at sink tj. 

For a sink tj we will assume that the cost function Cj{): 

• Is a function of Xj, i.e. Cj{Xj) 
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Figure 9: Load assignment problem 

• Has decreasing marginal costs Cj (Xj). 

Note that this second property is just concavity, since Cj{Xj) < 0. 

A Mathematical Formulation 

mill E Cj{Xj) 
Xij = k Vsi e S 

Xij > Mij G E 

3.1 A Hardness Result 

Set Cover 

Input: Elements {vi,v-2, . . . ,w„} and sets {5i, S'2, . . . , S'r}- 

Output: A collection S of sets of minimum cardinality that covers every element. 

It is known that the Set Cover problem can not be approximated to within an O(logn) factor. 

Reduction to Load Assignment 

There is an approximation preserving reduction from Set Cover to the Load Assignment problem: 

• There is a source for each element Vi. 

• There is a sink tj for each set Sj. 

• There is an edge ij if the element Vi is in the set Sj . 

• The load li is one for each source. 
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• The marginal costs at sink j are (1, 0, 0, . . . , 0). 

Theorem. The Load Assignment problem can not be approximated to within an O(logn) factor .■ 

3.2 The Structure of a Solution 

Observation 

We may view solutions as permutations of {1, 2, . . . , r}. To see this, note that some optimal solution x has 
the following two properties: 

Property I. For any source Si, all of its load h is assigned to a single sink. 

Proof. Suppose not. Let some of the load be assigned to the sink ti and some to sink t2- Now if 
c'liXi) < £2(^2 ) then, by concavity, the total cost may be reduced by re-routing the load from t-2 to ti. 
Similarly if 4(^2) < c[{Xi).U 

From now on we may assume that the load at each source is one. 

3.3 Saturated Sinks 

Sink tj is saturated if each source in Tj assigns its load to tj . 
Property II. At least one sink is saturated. 
Proof. Suppose not. Then, by concavity. 



Consider a directed graph with a node for each sink, and arcs (j, /(j)). Then as each no sink is saturated, 
each vertex has out degree one. Thus the graph contains a cycle C. Summing around the cycle, we obtain 
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Let k be the last node in the cycle, so that f(k) is the node we started out from. The above telescopes to 
give 

which is a contradiction. ■ 

WLOG assume that ti is saturated. Then by similar arguments: 

3 12 that is saturated with respect to 5 — Fi 

3 ^3 that is saturated with respect to 5 — Fi — r2 etc . . . 

3.4 A Greedy Algorithm 

Greedy 

Calculate the average saturated cost of each sink, i.e. 

Assign Tj- to tj- , the sink with the minimum average saturated cost. 

Repeat on 5 — Tj, until each source is covered 

Running Time 

The running time of the greedy algorithm is linear in the number of edges. It takes 0{m) time to calculate 
the initial average saturated costs and 0{m) time, over all subsequent iterations, to update these average 
saturated costs. 

3.5 Analysis of Greedy Algorithm 

The optimal solution T* = {^1,^2,- ■ ■ ^^1) c(T*) = opt. 

The greedy solution T = {^1,^2, ■ ■ ■ ,ts} covers rij sources in step i. 

Theorem. Greedy is an 0(logn)-approximation algorithm. 

Proof. Note that OPx/n is just a weighted average of sink average costs. Thus, we see that 

At step 1 some sink in T* has average cost < ^ 
At step 2 some sink in T* has average cost < -2E3_ 



At step s some sink in T* has average cost < — 

^ ^ — n— ni ns — \ 

Now, by concavity, the average saturated cost of a sink is at most the average cost. Thus, the greedy 
algorithm has a total cost 



Tracking Distributed Objects-12 



OPT OPT OPT 

c(T) < m + n2 + --- + ns 

n n — Til n — rii — ■ ■ ■ — ng-i 



< ! + - + - + ••• + - OPT 

2 3 n 



= H„ OPT = 0(log n) OPT 
where in the second inequality we've used the fact that 



nil 1 
< - + - + ••• + 



m n n — 1 n — m + 1 

3.6 Some Special Cases 

Concave Marginal Cost Functions 

The actual performance guarantee of greedy is ^ where 

1 c,-(|r,-|) . 1 c,-(|r,-|) 

a = mm — — f'^ < mm — — ' f'^ 

j c;.(o) |r,| i cj{i) \rj\ 

As a corollary, for example, if the marginal cost function is concave then we have a 2-approximation algorithm. 

Constant Number of "Tricky" Sinks 

If there is only a fixed number k of sinks with non-constant marginal cost functions then we can solve 
the problem optimally. 

Convex Cost Functions 

If the cost function is convex then the optimal solution can be found by a minimum cost flow algorithm. 

3.7 Open Problems 

Problem I 

What if there are capacities at the sinks? 

Problem II 

What if the cost functions are more complex? 

Problem III 

What if there is added structure regarding source-sink links? 
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4 Another Open Problem: Online Assignment 

This is essentially the same as the load assignment problem with two key differences: the assignment must 
be performed online, and assigning unit load from client i to server j incurs c(i j) cost. 

One might consider two possible server load models: a capacity Cj for server j, or a function fj of load 
that gives additional cost for each unit load that is served by j. 



Demand 




Figure 10: Online load assignment 
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